Informations and analysis obtained from Phase 1 would contribute to further investigation on this multiple linear regression problem using statistical modelling to evaluate the relationship between a predictor variable and the response variable (price of diamond) while controlling the potential influence of other diamond's variables. This would assist in predicting the diamond's price strategically using its different variables in Phase 2.
After defining the regression model formula, we create an ordinary least squares (OLS) model to the encoded features. The OLS regression results table contains numerous informations about the dataset, such as the dependent variable, number of observations, degree of freedom of residuals, the R-squared and Adjusted R-squared value, F-statistics, the coefficient term, standard error parameters, t-statistics, p-values, and confidence intervals. In order to better comprehend the difference between the real and predicted prices of diamonds, we then establish a new data frame to hold the actual price of diamonds, the predicted price of diamonds, as well as its residuals.
Next, we check whether the multiple linear regression model is valid or not by running diagnostic checks with 4 assumptions to satifsy, which are linearity, nearly normal residuals, constant variability, and independence of residuals. In backwards feature selection, we eliminate features that are not statistically significant by looking that features whose p-value is above 0.05 since statistical significance is indicated by a p-value of less than 0.05. We create a new OLS model and data frame that stores the actual price of diamonds, the predicted price of diamonds, and its residuals as well as to validate whether the model satisfies the previous 4 assumptions in our Reduced Model Overview and Reduced Model Diagnostic Checks. This is similar to what we did in the Full Model Overview and Full Model Diagnostic Checks, but with the insignificant features eliminated.
In order to adopt our methodology for estimating diamond prices, we will first review our phase 1 report and rename any columns in the Data Cleaning and Preprocessing section that were left unaltered. After that, we will be able to obtain the statistical model formula for the diamond dataset. We will then create a new data frame called residuals full to display the comparison between the actual price and the projected price as well as the residuals for the entire model. The regression residuals and anticipated values will be plotted using this data frame.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")
df = pd.read_csv('Phase2_Group15.csv')
df.head()
| carat | cut | color | clarity | depth | table | x | y | z | price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 3.95 | 3.98 | 2.43 | 326 |
| 1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 3.89 | 3.84 | 2.31 | 326 |
| 2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 4.05 | 4.07 | 2.31 | 327 |
| 3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 4.20 | 4.23 | 2.63 | 334 |
| 4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 4.34 | 4.35 | 2.75 | 335 |
In order to estimate the price of diamonds, we first construct a multiple linear regression using all the features in this dataset. However, since we did not renamed some of the column in the data cleaning and preprocessing section of our Phase 1 report, we must do so before continuing with the full model overview.
df = df.rename(columns={"depth": "total_depth_percentage", "x": "length", "y": "width", "z": "depth"})
Next, we get the statistical model formula of this dataset in a Python string form.
formula_string_vars = ' + '.join(df.drop(columns='price').columns)
formula_string = 'price ~ ' + formula_string_vars
print('formula_string: ', formula_string)
formula_string: price ~ carat + cut + color + clarity + total_depth_percentage + table + length + width + depth
Then, we create a new formula string containing the encoded features by using the get_dummies() function to one-hot encode categorical features. But first, we need to replace the whitespace in the cut quality categorical feature with an underscore sign. This is because it contains a variable called "Very Good" and the whitespace would cause the Statsmodels module to output an "invalid syntax" error.
# Replace the whitespace with underscore for all variable that has a whitespace
categoricalColumns = df.columns[df.dtypes==object].tolist()
for col in categoricalColumns:
df[col] = df[col].str.replace(' ', '_')
data_encoded = pd.get_dummies(df, drop_first=True)
data_encoded.head()
| carat | total_depth_percentage | table | length | width | depth | price | cut_Good | cut_Ideal | cut_Premium | cut_Very_Good | color_E | color_F | color_G | color_H | color_I | color_J | clarity_IF | clarity_SI1 | clarity_SI2 | clarity_VS1 | clarity_VS2 | clarity_VVS1 | clarity_VVS2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.23 | 61.5 | 55.0 | 3.95 | 3.98 | 2.43 | 326 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 0.21 | 59.8 | 61.0 | 3.89 | 3.84 | 2.31 | 326 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0.23 | 56.9 | 65.0 | 4.05 | 4.07 | 2.31 | 327 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 0.29 | 62.4 | 58.0 | 4.20 | 4.23 | 2.63 | 334 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 4 | 0.31 | 63.3 | 58.0 | 4.34 | 4.35 | 2.75 | 335 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
formula_string_vars_encoded = ' + '.join(data_encoded.drop(columns='price').columns)
formula_string_encoded = 'price ~ ' + formula_string_vars_encoded
print('formula_string_encoded: ', formula_string_encoded)
formula_string_encoded: price ~ carat + total_depth_percentage + table + length + width + depth + cut_Good + cut_Ideal + cut_Premium + cut_Very_Good + color_E + color_F + color_G + color_H + color_I + color_J + clarity_IF + clarity_SI1 + clarity_SI2 + clarity_VS1 + clarity_VS2 + clarity_VVS1 + clarity_VVS2
Now that we have defined our statistical model formula as a Python string, we fit an OLS (ordinary least squares) model to our encoded data with "price" being the dependent variable.
full_model = sm.formula.ols(formula=formula_string_encoded, data=data_encoded)
full_model_fitted = full_model.fit()
print(full_model_fitted.summary())
OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 0.920
Model: OLS Adj. R-squared: 0.920
Method: Least Squares F-statistic: 2.688e+04
Date: Sat, 22 Oct 2022 Prob (F-statistic): 0.00
Time: 15:45:19 Log-Likelihood: -4.5573e+05
No. Observations: 53940 AIC: 9.115e+05
Df Residuals: 53916 BIC: 9.117e+05
Df Model: 23
Covariance Type: nonrobust
==========================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------
Intercept 2184.4774 408.197 5.352 0.000 1384.409 2984.546
carat 1.126e+04 48.628 231.494 0.000 1.12e+04 1.14e+04
total_depth_percentage -63.8061 4.535 -14.071 0.000 -72.694 -54.918
table -26.4741 2.912 -9.092 0.000 -32.181 -20.767
length -1008.2611 32.898 -30.648 0.000 -1072.741 -943.781
width 9.6089 19.333 0.497 0.619 -28.284 47.502
depth -50.1189 33.486 -1.497 0.134 -115.752 15.515
cut_Good 579.7514 33.592 17.259 0.000 513.911 645.592
cut_Ideal 832.9118 33.407 24.932 0.000 767.433 898.391
cut_Premium 762.1440 32.228 23.649 0.000 698.978 825.310
cut_Very_Good 726.7826 32.241 22.542 0.000 663.591 789.975
color_E -209.1181 17.893 -11.687 0.000 -244.189 -174.047
color_F -272.8538 18.093 -15.081 0.000 -308.316 -237.392
color_G -482.0389 17.716 -27.209 0.000 -516.763 -447.315
color_H -980.2667 18.836 -52.043 0.000 -1017.185 -943.348
color_I -1466.2445 21.162 -69.286 0.000 -1507.723 -1424.766
color_J -2369.3981 26.131 -90.674 0.000 -2420.615 -2318.181
clarity_IF 5345.1022 51.024 104.757 0.000 5245.095 5445.110
clarity_SI1 3665.4721 43.634 84.005 0.000 3579.949 3750.995
clarity_SI2 2702.5863 43.818 61.677 0.000 2616.702 2788.471
clarity_VS1 4578.3979 44.546 102.779 0.000 4491.087 4665.708
clarity_VS2 4267.2236 43.853 97.306 0.000 4181.270 4353.177
clarity_VVS1 5007.7590 47.160 106.187 0.000 4915.326 5100.192
clarity_VVS2 4950.8141 45.855 107.967 0.000 4860.938 5040.690
==============================================================================
Omnibus: 14433.356 Durbin-Watson: 1.183
Prob(Omnibus): 0.000 Jarque-Bera (JB): 565680.446
Skew: 0.577 Prob(JB): 0.00
Kurtosis: 18.823 Cond. No. 7.14e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.14e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
The whole model's adjusted R-squared value is 0.920, which indicates that the model only accounts for 92% of the variance. By looking at the p-values, we observe that the majority of them are highly significant, though there is 1 insignificant variable, which is the width variable becuase it's p-value of 0.619 implies that there is a 61.9% likelihood that it has no impact on the dependent/target variable, price. To see the comparison between the actual price and the predicted price as well as the residuals for the whole model, we will now construct a new data frame named residuals full. This data frame will be used to plot the predicted values and the regression residuals.
residuals_full = pd.DataFrame({'actual': data_encoded['price'],
'predicted': full_model_fitted.fittedvalues,
'residual': full_model_fitted.resid})
residuals_full.head(10)
| actual | predicted | residual | |
|---|---|---|---|
| 0 | 326 | -1346.364288 | 1672.364288 |
| 1 | 326 | -664.595411 | 990.595411 |
| 2 | 327 | 211.107106 | 115.892894 |
| 3 | 334 | -830.737177 | 1164.737177 |
| 4 | 335 | -3459.224220 | 3794.224220 |
| 5 | 336 | -1380.487569 | 1716.487569 |
| 6 | 336 | -397.875201 | 733.875201 |
| 7 | 337 | -1073.323502 | 1410.323502 |
| 8 | 337 | -1040.023136 | 1377.023136 |
| 9 | 338 | -420.417884 | 758.417884 |
The scatter plot below shows the correlation between the actual and expected price values, with the x-axis as the actual price, while the y-axis shows the predicted value. We can observe from Figure 1 below that the model makes several predictions above 20,000 about the price of diamonds, even though there are no diamonds in the dataset with prices higher than that.
def plot_line(axis, slope, intercept, **kargs):
xmin, xmax = axis.get_xlim()
plt.plot([xmin, xmax], [xmin*slope+intercept, xmax*slope+intercept], **kargs)
# Creating scatter plot
plt.figure(figsize=(15, 10))
plt.scatter(residuals_full['actual'], residuals_full['predicted'], alpha=0.3)
plt.xlim(-1000, 20000)
plt.ylim(-7000, 45000)
plot_line(axis=plt.gca(), slope=1, intercept=0, c="blue")
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Figure 1: Scatter Plot of Actual VS. Predicted Price for the Full Model', fontsize=15)
plt.show()
According to the above chart, the model's actual and predicted results have a relatively significant correlation and because each data point is quite close to the projected regression line, we may conclude that the regression model fits the data reasonably well.
Multiple linear regression model has 4 conditions to satisfy:
Diagnostic plots are used below to determine the validity of the model.
In Figure 2, the residual plot compare residuals and the predicted price since this is a multiple linear regression problem which means that having a comparison between residuals and each exploratory variable would not be helpful in providing a full overview in the dataset. The residual plot shows that it does not satisfy the constant variability conditions for this multiple linear regression full model, having the residuals not centred around 0. Instead, the residuals present a negative trend since the residual value would decrease as the predicted price increase. Note that residual value is used on the y-axis instead of predicted price to avoid possible violations like collinearity between the predictors when checking its linearity.
plt.figure(figsize=(15,10))
sns.residplot(x = residuals_full['predicted'], y = residuals_full['residual'])
plt.xlabel('Predicted Price', fontsize = 16)
plt.ylabel('Residuals', fontsize = 16)
plt.title('Figure 2: Scatterplot of Predicted Price VS. Residuals for the Full Model', fontsize=15)
plt.show();
In Figure 3, the residual plot perform a comparison between the actual diamond price and the residuals for the full model. The plot present a notable observation of having the residuals demonstrate a distinctive curvature showing that it has an increasing trend, instead of dispersed around 0 in a horizontal band, making the assumption of constant variance is not likely to be true of multiple linear regression model. There is an extreme outlier occur (residuals of less than -15000, and predicted price of more than USD 17500) at the bottom-right corner of the graph, which present a huge concern in this dataset.
plt.figure(figsize=(15,10))
sns.residplot(x = residuals_full['actual'], y = residuals_full['residual'])
plt.xlabel('Actual Price', fontsize = 16)
plt.ylabel('Residuals', fontsize = 16)
plt.title('Figure 3: Residual Plot of Actual Price VS. Residuals for the Full Model', fontsize=15)
plt.show();
In Figure 4, the histogram of the the actual price illustrate that its distribution is significantly right-skewed, meaning there are more diamonds fall within the price range of USD 1 to USD 1000 than the other price ranges. Similarly, the predicted price is also in right-skewed distribution. In practice, no diamond should be fall unders negative price range. Hence, it present an extrapolation (predicted price is beyond the original price range) by the full model that we should take note of.
plt.figure(figsize=(15,10))
plt.hist(residuals_full['predicted'], label='Predicted', bins=20, alpha=0.7);
plt.hist(residuals_full['actual'], label='Actual', bins=20, alpha=0.7);
plt.xlabel('Actual Price');
plt.ylabel('Frequency');
plt.title('Figure 4: Histogram of Actual Price VS. Predicted Price for Full Model', fontsize=15);
plt.legend()
plt.show();
In theory, the distribution of the residuals should be nearly normal. In Figure 5, the histogram of residual for full model is nearly symmetric and consist of a single prominent peak (unimodal) that fall under negative residuals while illustrating that there are no extreme outliers in this data set. Therefore, we could indicate that the residuals of full model clearly satisfy the condition of having nearly normal residuals.
plt.figure(figsize=(15,10))
plt.hist(residuals_full['residual'], bins = 50);
plt.xlabel('Residual');
plt.ylabel('Frequency');
plt.title('Figure 5: Histogram of Residuals for Full Model', fontsize=15);
plt.show();
The diamond dataset itself does not recognised as a time-series data, meaning the collection of different diamond's features' observations is not obtained through repeated measurements over time. Hence, it is clear that the observations gained in the dataset are independent to each other which satisfy the condition of indepedent residual. In Figure 6, a scatter plot of residual and its order of the data collection is presented and there is an increasing trend presented from the order 20000 to 28000. This suggests misfittings or anomalies within the full model.
plt.figure(figsize=(20,10))
sns.scatterplot(x = residuals_full['residual'].index, y = residuals_full['residual'], size = .0000000000000000000000000000000000000000000001)
plt.xlabel('Order of Data Collection', fontsize = 16)
plt.ylabel('Residuals', fontsize = 16)
plt.title('Figure 6: Scatter Plot of Residuals VS. Order of Data Collection', fontsize=15)
plt.show();
## create the patsy model description from formula
patsy_description = patsy.ModelDesc.from_formula(formula_string_encoded)
# initialize feature-selected fit to full model
linreg_fit = full_model_fitted
# do backwards elimination using p-values
p_val_cutoff = 0.05
print('\nPerforming backwards feature selection using p-values:')
while True:
pval_series = linreg_fit.pvalues.drop(labels='Intercept')
pval_series = pval_series.sort_values(ascending=False)
term = pval_series.index[0]
pval = pval_series[0]
if (pval < p_val_cutoff):
break
term_components = term.split(':')
print(f'\nRemoving term "{term}" with p-value {pval:.4}')
if (len(term_components) == 1): ## this is a main effect term
patsy_description.rhs_termlist.remove(patsy.Term([patsy.EvalFactor(term_components[0])]))
else: ## this is an interaction term
patsy_description.rhs_termlist.remove(patsy.Term([patsy.EvalFactor(term_components[0]),
patsy.EvalFactor(term_components[1])]))
linreg_fit = smf.ols(formula=patsy_description, data=data_encoded).fit()
###
## this is the clean fit after backwards elimination
model_reduced_fitted = smf.ols(formula = patsy_description, data = data_encoded).fit()
###
#########
print("\n***")
print(model_reduced_fitted.summary())
print("***")
print(f"Regression number of terms: {len(model_reduced_fitted.model.exog_names)}")
print(f"Regression F-distribution p-value: {model_reduced_fitted.f_pvalue:.4f}")
print(f"Regression R-squared: {model_reduced_fitted.rsquared:.4f}")
print(f"Regression Adjusted R-squared: {model_reduced_fitted.rsquared_adj:.4f}")
Performing backwards feature selection using p-values:
Removing term "width" with p-value 0.6192
Removing term "depth" with p-value 0.1488
***
OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 0.920
Model: OLS Adj. R-squared: 0.920
Method: Least Squares F-statistic: 2.944e+04
Date: Sat, 22 Oct 2022 Prob (F-statistic): 0.00
Time: 15:45:27 Log-Likelihood: -4.5573e+05
No. Observations: 53940 AIC: 9.115e+05
Df Residuals: 53918 BIC: 9.117e+05
Df Model: 21
Covariance Type: nonrobust
==========================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------
Intercept 2366.0858 390.351 6.061 0.000 1600.995 3131.177
carat 1.126e+04 48.600 231.626 0.000 1.12e+04 1.14e+04
total_depth_percentage -66.7693 4.091 -16.322 0.000 -74.787 -58.752
table -26.4573 2.911 -9.089 0.000 -32.163 -20.752
length -1029.4779 20.549 -50.098 0.000 -1069.755 -989.201
cut_Good 580.2405 33.572 17.283 0.000 514.438 646.043
cut_Ideal 833.2603 33.396 24.951 0.000 767.804 898.716
cut_Premium 762.7586 32.225 23.670 0.000 699.598 825.919
cut_Very_Good 726.8201 32.212 22.564 0.000 663.685 789.955
color_E -209.2370 17.893 -11.694 0.000 -244.307 -174.167
color_F -272.8341 18.093 -15.080 0.000 -308.296 -237.372
color_G -481.9429 17.716 -27.204 0.000 -516.667 -447.219
color_H -980.1218 18.836 -52.035 0.000 -1017.040 -943.204
color_I -1466.1815 21.162 -69.283 0.000 -1507.660 -1424.703
color_J -2369.5038 26.131 -90.678 0.000 -2420.720 -2318.287
clarity_IF 5344.3381 51.015 104.761 0.000 5244.349 5444.327
clarity_SI1 3664.9053 43.627 84.005 0.000 3579.396 3750.415
clarity_SI2 2702.0771 43.812 61.674 0.000 2616.205 2787.949
clarity_VS1 4577.5892 44.535 102.786 0.000 4490.300 4664.879
clarity_VS2 4266.6117 43.847 97.308 0.000 4180.672 4352.551
clarity_VVS1 5007.0611 47.152 106.190 0.000 4914.643 5099.479
clarity_VVS2 4950.1680 45.847 107.972 0.000 4860.308 5040.028
==============================================================================
Omnibus: 14433.691 Durbin-Watson: 1.183
Prob(Omnibus): 0.000 Jarque-Bera (JB): 566407.977
Skew: 0.577 Prob(JB): 0.00
Kurtosis: 18.833 Cond. No. 6.81e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.81e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
***
Regression number of terms: 22
Regression F-distribution p-value: 0.0000
Regression R-squared: 0.9198
Regression Adjusted R-squared: 0.9198
From the code above, we eliminated the width and depth features because they had p-values of 0.6192 and 0.1488, respectively, and we then created a new OLS model using the new clean fitted model. As a result, we have completed the feature selection process by utilising backwards elimination to filter out features that were irrelevant for our model, as evidenced by the OLS Regression Result above, where no longer any features have a p-value below 0.05.
formula_string_vars_encoded = ' + '.join(data_encoded.drop(columns=['price', 'width', 'depth']).columns)
formula_string_encoded = 'price ~ ' + formula_string_vars_encoded
print('formula_string_encoded: ', formula_string_encoded)
formula_string_encoded: price ~ carat + total_depth_percentage + table + length + cut_Good + cut_Ideal + cut_Premium + cut_Very_Good + color_E + color_F + color_G + color_H + color_I + color_J + clarity_IF + clarity_SI1 + clarity_SI2 + clarity_VS1 + clarity_VS2 + clarity_VVS1 + clarity_VVS2
residuals_reduced = pd.DataFrame({'actual': data_encoded['price'],
'predicted': model_reduced_fitted.fittedvalues,
'residual': model_reduced_fitted.resid})
residuals_reduced.head(10)
| actual | predicted | residual | |
|---|---|---|---|
| 0 | 326 | -1346.614726 | 1672.614726 |
| 1 | 326 | -662.895014 | 988.895014 |
| 2 | 327 | 215.495406 | 111.504594 |
| 3 | 334 | -830.942042 | 1164.942042 |
| 4 | 335 | -3460.396928 | 3795.396928 |
| 5 | 336 | -1382.081003 | 1718.081003 |
| 6 | 336 | -398.775782 | 734.775782 |
| 7 | 337 | -1073.647468 | 1410.647468 |
| 8 | 337 | -1044.665285 | 1381.665285 |
| 9 | 338 | -418.429847 | 756.429847 |
# get a scatter plot
plt.figure(figsize=(15,10))
plt.scatter(residuals_reduced['actual'], residuals_reduced['predicted'], alpha=0.3);
plot_line(axis=plt.gca(), slope=1, intercept=0, c="yellow");
plt.xlabel('Actual Price');
plt.ylabel('Predicted Price');
plt.title('Figure 7: Scatterplot of Actual VS. Predicted Price for Reduced Model', fontsize=15);
plt.show();
This model returns an adjusted R-squared of 0.92, which is similar to the Full Model adjusted R-squared, indicating that it still has 92% of the variance is explained by the reduced model, even with 2 less variable. The p-values show that, as predicted, all of them are significant at the 5% level as a result of the backwards feature selection. Figure 7 shows that our model still has the same problems, which is that the model exaggerates the expensive diamond prices while underestimating cheaper diamond prices. The diagnostic tests will now be run on this reduced model.
plt.figure(figsize=(15,10))
plt.scatter(residuals_reduced['predicted'], residuals_reduced['residual'], alpha=0.3);
plt.xlabel('Predicted Price');
plt.ylabel('Residuals')
plt.title('Figure 8: Scatter plot of Residuals VS. Predicted Price for Reduced Model', fontsize=15)
plt.show();
Figure 2 and Figure 8 have a striking resemblance, indicating that the residuals have the same banding pattern even though the value of the residuals between the Full Model and the Reduced Model has a slight difference. Next we will create a scatterplot to with the residuals vs. the actual price, a histogram of the actual and predicted price, and a histogram of residuals for the reduced model.
plt.figure(figsize=(15,10))
sns.residplot(x = residuals_reduced['actual'], y = residuals_reduced['residual'])
plt.xlabel('Actual Price', fontsize = 16)
plt.ylabel('Residuals', fontsize = 16)
plt.title('Figure 9: Residual Plot of Actual Price VS. Residuals for the Reduced Model', fontsize=15)
plt.show();
From Figure 9 above, notice that the most of the points are scattered around zero. However, as the price increases, more outliers seems to be present as well, which could be a violation of constant variability.
plt.figure(figsize=(15,10))
plt.hist(residuals_reduced['predicted'], label='Predicted', bins=20, alpha=0.7);
plt.hist(residuals_reduced['actual'], label='Actual', bins=20, alpha=0.7);
plt.xlabel('Actual Price');
plt.ylabel('Frequency');
plt.title('Figure 10: Histogram of Actual Price VS. Predicted Price for Reduced Model', fontsize=15);
plt.legend()
plt.show();
plt.figure(figsize=(15,10))
plt.hist(residuals_reduced['residual'], bins = 50);
plt.xlabel('Residual');
plt.ylabel('Frequency');
plt.title('Figure 11: Histogram of Residuals for Reduced Model', fontsize = 15)
plt.show();
All 3 of the graphs above have an uncanny similarity with the same graphs for the full model even though 2 features have been removed. This might indicate that both width and depth does not have a very big impact on determining a diamond's price. From Figure 11, we can see that the histogram of the residuals for reduced model is unimodal with the peak of the residuals above 30000, veering to the right and is almost symmetrical. The histogram also indicate that residuals are distributed nearly evenly around zero which indicates that the normality assumption is likely to be true.
pandas.get_dummies ensures the modeling process perform appropriately since modelling treat the order of value as its siginificance attributes. Hence, a larger number has more significance than a smaller one. However, these variables does not contain any rankings which could raise issues of poor prediction and performance. Hence, one-hot encoding add a binary variable for each unique variables which make the dataset more expressive and useful. Using statsmodels.formula.ols module to implement Ordinary Least Squares method for the multiple linear regression problem clearly allows us to establish and examine the relationship between the response variable (diamond's price in USD) and multiple exploratory variables. This method determines the appropriate values of intercepts and slope in order to minimise the total sum of squared residuals. The summary presents the values of necessary metrics, in particular the adjusted R-squared. However, the adjusted R-squared value is either maintain the same or increased with additional varibales included in this model. Hence, there is a possibility having the model being relatively accurate even if those variables are contributing poorly or having no effect to the results. Thus, there's a need of conducting feature selection for the model to eliminate any variables that has minimal influence in the model.
One limitation of this model is that a large data set is required to achieve reliable results due to the fact that these regression results are sensitive to outliers as it negatively affects the least square regression line, thus resulting in inaccurate predictions. Futhermore, data that are not independent are not suitable for this model since it assumes the observations gathered do not correlated with each other. Additionally, because the model's multivariate normality assumption assumed that residuals were normally distributed although, in reality, most residual distributions are almost normal, therefore there is a significant bias evident in the results. We can see the association between the diamond's real and anticipated prices using a scatter plot using the function "matplotlib.scatter()," which also helps us determine how accurate the model is for this multiple linear regression problem. It also enables us to determine the level of correlation between the diamond's real price and the model's forecast of its price. It specifically highlights the results' linearity, direction, and strength of relationship.
On the other hand, constructing residual plots with seaborn.residplot for predicted and actual price demonstrates that positive residuals indicate a lower prediction, whilst negative residuals indicate a higher prediction. Limitation of model could be spotted using the residual plots when it present any outlier and trends such as fanning and funneling effect within the plot. By plotting the histogram using matplotlib.hist() for the residuals for both the full and reduced model, it helps us determine the residuals' distribution pattern (normal, skewed, bimodal, multimodal, uniform), which establishes whether these residuals satisfy the nearly normal residual assumptions. However, the number of bins for histogram could affect its distribution pattern as having more bins may cause the pattern to change. Thus, it greatly affects the accuracy and reliability of the model when using it to determine its normality during diagnostic check.
The backward feature selection strategy through assessing the p-value of each exploratory variable using the significance value of 0.05 may prevent some important variables from contributing the results since this selection feature does not have the ability to identify less predictive exploratory variables. The alternative is to set a much higher threshold p-value and consult jewellery specialists when selecting variables while incorporating statistical significance into it, since this would significantly improved the model's reliability and accuracy by allowing more relevant variables to be use in the model. It is important to note that this backward elimination strategy has a drawback of not allowing eliminated variable to be reuse in the model which could have a significant influence in the final model eventhough it initially assess the combined predictive ability of all variables since it begins with having all variables included in the model.
In our data cleaning and preprocessing section of our Phase 1 report, we initially clean the data by removing any rows that contains missing values and outliers that is based on the diamond's carat. However, we discovered that the dataset does not contain any missing values by using the isnull().sum() function from Pandas. Before we remove the outliers, we wanted to know the total number of outliers present in this dataset. So, to get the total number of outliers, we first calculate the lower and upper quartiles of the carat feature which allows us to determine the interquartile range (IQR). Next, we determine the upper and lower whisker using the interquartile range, as well as the lower and upper quartile. By filtering the rows of the carat feature whose carat is lower than the lower whisker or higher than the upper whisker, the total number of outliers is then obtained. After the outliers have been removed, we assigned the carat column with values that are not outliers.
Additionally, we made the decision to not remove any features from this dataset as we think that all of the features are relevant in predicting the price of a diamond. We opted not to change any of the data types of the features because they already corresponded to what we had in mind. To finalised our data cleaning process, we display 5 random rows from our cleaned data.
Following the dataset's cleaning, we provided a wide range of graphs to illustrate the relationship between the dataset's features and the price of diamonds. The values are plotted into various graph forms, including histograms, bar charts, violin plots, box plots, boxen plots, scatter plots, bar plots, and strip plots, each of which comprises one, two, or three variables for comparison. We used the cut, colour grade, clarity, and carat variables for 1-variable plots to determine the frequency of each category in each of the selected variables, along with a brief description of each graph. For 2 variable plots, we chose to compare the pricing of diamonds based on the cut, colour grade, clarity, and carat value, and length. To assess the strength of the relationship between price with carat and length, we additionally apply the Pearson Correlation Coefficient formula from the scipy.stats module. Last but not least, we apply nearly all of the features to be compared with the price for the three variable plots, with the exception of the length, depth, width, and total depth percentage variable.
In our Phase 2 report, we first provide an overview of our full model by forming the statistical model formula for this dataset into a Python string. The formula begins with the price, which serves as the target feature of our report, and is followed by the numerical variables and the one-hot encoded category features. After the formula is established, we build an OLS model with price acting as the dependent variable for the encoded features. Then, in order to compare the actual and predicted prices and plot them onto graphs, we establish a new data frame that stores the values of the actual, predicted, and residuals. Diagnostic plots are created to check the validity of the full model based on the 4 different conditions. In particular, scatter plots and histograms are used to compare the residuals value, predicted price and the actual price of the full model. These are to examine the variability, distribution pattern and independence of the residuals.
We then proceed by doing backwards feature selection after running a diagnostic check on the entire model. This is done to ensure that the model only contains features that are essential to it and have a meaningful impact. Any irrelevant features, defined as those with a significance level of less than 5% or a p-value less than 0.05, are removed using the backwards feature selection technique. After eliminating the irrelevant variables, we generated a new OLS model to analyse any differences between the full model and the now reduced full model, as well as to ensure that there are no more variables with p-values less than 0.05.
After confirming that the model has no irrelevant features, we present an overview of the reduced model in the same way that we did for the full model. To check for discrepancies between the full and reduced models, we first build a new data frame to hold the reduced values of the actual and predicted price of diamonds, together with their residuals and then we generate them into scatterplots. Finally, we conduct a diagnostic check to validate our reduced model using the same assumptions as we did for the complete model: linearity, nearly normal residuals, constant variability, and independence of residuals. Besides that, diagnostic checks are carried out to see if the reduced model violates any of the 4 assumptions.
Performing diagnostic check on the reduced model further confirm that the model has violated the constant variability assumptions even after removing 2 variables since the residual plot for predicted price illustrate the residuals exhibits a downward trend as the predicted price increased instead of centred around 0. Similarly, the residuals plot for actual price demonstrate the model's violation on the constant variability assumptions as the residuals present a curvature as seen in Figure 9. On the other hand, the residuals histogram plot show the reduced model to satisfy the nearly normal residuals assumptions with its distribution to be unimodal and nearly symmetric alongside the absence of outliers. In Figure 6, the scatter plot of residuals and the sequence in which the data collected shown an increasing trend from order of 20000 to about 28000 as mentioned previosuly. This may suggest anomalies to be occur within the dataset since this is not a time-series data where the diamond's observations are independent to one another.
The final regression model could accurately predicts the price within $\pm13750$ USD as accorddng to the findings above though the model often predicted significantly higher price at about 48% more than the actual price (the highest actual diamond price: 18823 USD, the highest predicted diamond price: 39403).
Phase 1 aimed to estimate the diamond's price in US dollars based on the features that have been established within the first report. In addition to some data processing and cleaning, we also explored and visualised the data using graphs and charts. The Phase 1 report's goal was to learn more about the relationships between the factors.
In Phase 1, we learned that a diamond's price increases with its weight (carat) and size. Excellent clarity, colour, and cut are additional decisive considerations, however they are not as important as the previous ones. The depth and table features of a diamond tend to be the least important factors for determining its price. Although data visualisation gives us important insights into the relationships between the diamond's attributes, more research utilising statistical modelling is necessary to completely comprehend the properties of the data for strategically predicting the diamond's price.
In Phase 2, we examined statistical modelling of diamond price trends to further achieve our goal in predicting diamond prices. We initially prepared a Python string containing all of the independent variables from our dataset in order to fit a multiple linear regression. Using Pandas' get dummies function, the categorical features in our dataset are one-hot encoded, and the encoded features are then added to the regression model's formula.
After that, we looked at the relationship between the quantity of data and the level of residual freedom. To determine the correlation between the actual and anticipated prices of diamonds, we employed a straightforward least squares regression model. Numerous details about the dataset were included in the OLS regression results table, including the dependent variable, t-statistics, p-values, and confidence ranges. We create a new data frame to better analyse the discrepancy between the actual and forecasted diamond prices.
Then, using the assumptions of linearity, nearly normal residuals, constant variability, and independent residuals, we performed diagnostic checks. When using reverse feature selection, we look for features with p-values above 0.05 in order to eliminate those that are not statistically significant. With the unnecessary features removed, this is comparable to what we performed in the Full Model Overview and Full Model Diagnostic Checks.